Description

Background & Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances. Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards You need to identify the best possible model that will give the required performance image.png

Data Description

• CLIENTNUM: Client number. Unique identifier for the customer holding the account
• Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
• Customer_Age: Age in Years
• Gender: Gender of the account holder
• Dependent_count: Number of dependents
• Education_Level:  Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
• Marital_Status: Marital Status of the account holder
• Income_Category: Annual Income Category of the account holder
• Card_Category: Type of Card
• Months_on_book: Period of relationship with the bank
• Total_Relationship_Count: Total no. of products held by the customer
• Months_Inactive_12_mon: No. of months inactive in the last 12 months
• Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
• Credit_Limit: Credit Limit on the Credit Card
• Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
• Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
• Total_Trans_Amt: Total Transaction Amount (Last 12 months)
• Total_Trans_Ct: Total Transaction Count (Last 12 months)
• Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
• Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
• Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

Importing Libraries

Loading Data

Data Overview

Observations

Observations

Data Pre-Processing

Modify the Target variable to Numeric

Function to convert "Existing Customer" to 1 and "Attrited Customer" to 0

EDA

Univariate analysis

Observations on Customer_Age

Observations on Months_Inactive_12_mon

Observation on Months of Book

Observation on Contacts_Count_12_mon

Observation on Credit_Limit

Observation on Total_Revolving_Bal

Observation on Total_Revolving_Bal

Observation on Avg_Open_To_Buy

Observation on Total_Trans_Amt

Observation on Total_Trans_Ct

Observation on Avg_Utilization_Ratio

Observations on gender

Observations on Dependent_count

Observations on Education_Level

Observations on Marital_Status

Observations on Income_Category

Observations on Card_Category

Observations on Total_Relationship_Count

Bivariate Analysis

Attrition_Flag vs Income_Category

Attrition_Flag vs Total_Relationship_Count

Attrition_Flag vs Months_on_book

Attrition_Flag vs Customer_Age

Attrition_Flag vs Avg_Open_To_Buy

Data Preparation for Modeling

Imputing Missing Values

Creating Dummy Variables

Building the model

Model evaluation criterion:

Model can make wrong predictions as:

Problem Statement : Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas.

  1. Predicting an customer who can leave(Attrited)can be wrong.This will be loss for the Credit Card company. #### Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

We are building the following models

a) logistic regression without tuning b) decision tree without tuning c) bagging using base estimator of decision tree without tuning d) random forest classifier without tuning e) Adaboost without tuning f) Gradient boost without tuning.

Logistic Regression

Let's evaluate the model performance by using KFold and cross_val_score

Oversampling train data using SMOTE

Logistic Regression on oversampled data

Let's evaluate the model performance by using KFold and cross_val_score

Undersampling train data using Random Under Sampler

Logistic Regression on undersampled data

Let's evaluate the model performance by using KFold and cross_val_score

We can see that model tuned using undersampled data is best, let's check the performance of test data

Finding the coefficients

Coefficient interpretations

Converting coefficients to odds

Decision tree

Let's evaluate the model performance by using KFold and cross_val_score

Decision Tree on oversampled data

Let's evaluate the model performance by using KFold and cross_val_score

Undersampling train data using Random Under Sampler

Decision Tree on undersampled data

Let's evaluate the model performance by using KFold and cross_val_score

We can see that model tuned using undersampled data is best, let's check the performance of test data

Bagging Classifier

Let's evaluate the model performance by using KFold and cross_val_score

Bagging Classifier on oversampled data

Let's evaluate the model performance by using KFold and cross_val_score

Undersampling train data using Random Under Sampler

Bagging Classifier on undersampled data

Let's evaluate the model performance by using KFold and cross_val_score

We can see that model tuned using undersampled data is best, let's check the performance of test data

Bagging Classifier

Let's evaluate the model performance by using KFold and cross_val_score

Bagging Classifier on oversampled data

Let's evaluate the model performance by using KFold and cross_val_score

Undersampling train data using Random Under Sampler

Bagging Classifier on undersampled data

Let's evaluate the model performance by using KFold and cross_val_score

We can see that model tuned using undersampled data is best, let's check the performance of test data

Random Forest Classifier

Let's evaluate the model performance by using KFold and cross_val_score

Randpm Forest Classifier on oversampled data

Let's evaluate the model performance by using KFold and cross_val_score

Undersampling train data using Random Under Sampler

Random Forest Classifier on undersampled data

Let's evaluate the model performance by using KFold and cross_val_score

We can see that model tuned using undersampled data is best, let's check the performance of test data

XGB Classifier

Let's evaluate the model performance by using KFold and cross_val_score

XGB Classifier on oversampled data

Let's evaluate the model performance by using KFold and cross_val_score

Undersampling train data using Random Under Sampler

XGB Classifier on undersampled data

Let's evaluate the model performance by using KFold and cross_val_score

We can see that model tuned using undersampled data is best, let's check the performance of test data

Gradient Boost Classifier

Let's evaluate the model performance by using KFold and cross_val_score

Gradient Boost Classifier on oversampled data

Let's evaluate the model performance by using KFold and cross_val_score

Undersampling train data using Random Under Sampler

Greadient Boost Classifier on undersampled data

Let's evaluate the model performance by using KFold and cross_val_score

We can see that model tuned using undersampled data is best, let's check the performance of test data

Final Model Selection

Business Insights and Recommendations